-
-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes Indexed assignment of extra columns in SolutionArray #838
Conversation
@@ -523,9 +523,9 @@ def __init__(self, phase, shape=(0,), states=None, extra=None): | |||
"Unable to create extra column '{}': name is already " | |||
"used by SolutionArray objects.".format(name)) | |||
if not np.shape(v): | |||
self._extra[name] = [v]*self._shape[0] | |||
self._extra[name] = np.array([v]*self._shape[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is a pre-existing choice, but It may make sense to allow for more dimensions? I.e. this would be a case where we have something like
>>> states = ct.SolutionArray(gas, (6, 10), extra=[‘foo’, ‘bar’])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't clearly understand what you want to say. I guess you want to say foo and bar to be extra arrays of size (6,10)?
like
>>> states = ct.SolutionArray(gas, (6,10), extra={'foo':sd,"bar":sd})
where sd is any (6,10) list/np array
??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sin-ha ... what I wrote was deliberate - no typo there. It currently throws a ValueError
, but you could argue that it should initialize np.ndarray
's with the full _shape
, not just the first dimension. From my perspective, the shape of the _extra
entries should be consistent with other data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ischoegl I'm also not sure what you're suggesting, but it seems like it would be a new feature? In that case, I think it seems outside the scope of this fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ischoegl As of now extra expects dicts unless the shape of SolutionArray is trivial
- If the user passes a list we should be able to initiate the extra array with same shape.
- else the user will pass that as a dict , in that case we should validate extra with the complete SolutionArray shape instead of
self._shape[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also do think this is more of adding another functionality, seems a bit off of the issue so should it be pulled here or open another issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point I raised is a trivial addition, and the application was stated in my first post: it’s a reasonable assumption that the arrays are instantiated with correct dimensions based on the information already provided. My guess is that the former single-dimensioned approach was based on _extra
formerly being list-based. Now that this is no longer the case, putting in a more consistent approach is trivial.
What I am talking about affects some of the same lines that are changed already and won’t take more than a couple of additional lines. Hence, I’d like to see it considered as a friendly amendment of the original issue 😉
if isinstance(extra,str): | ||
extra = [extra] | ||
|
||
if isinstance(extra, list) and all(isinstance(name, str) for name in extra): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's captures almost everything! However, extra
could also be an instance of tuple
or ndarray
of strings? I believe all of these are supported for when no shape is used for the instantiation of SolutionArray
(i.e. I am talking about the former elif extra and self._shape == (0,):
case, in which case any of the above can be iterated over).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that'd be a quick fix but I don't think np.arrays are important?.
Besides I believe the other part should be removed where self._shape
is trivial as after this commit it would only be a desperate attempt to map something in extra, probably not useful (is there a particular use case?). it lacks checks and something weird like integers, for instance can pass which can't be used back.e.g.
>>>>states3 = ct.SolutionArray(gas, shape=(0,0,0), extra=[1])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don’t think that the alternative case can be executed for useful content either (you already caught all useful cases), I.e. I believe it can be deleted. If I’m not mistaken, the unit tests should still pass.
On a formatting note, there are a couple of spaces missing after commas (e.g. it should be shape=(0, 0, 0)
in the previous post; same for some other lines you edited). It’s minor, but the code should remain in compliance with PEP 8.
PS: ad ndarrays ... again it’s minor but it should be included for consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes the unit tests pass.
sorry for the formatting though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sin-ha! Thanks for taking this on. I've got several comments below, I think we need to change how this is implemented. Let me know your thoughts, or if you have any questions/concerns!
@@ -590,7 +596,7 @@ def append(self, state=None, **kwargs): | |||
raise IndexError("Can only append to 1D SolutionArray") | |||
|
|||
for name, value in self._extra.items(): | |||
value.append(kwargs.pop(name)) | |||
np.append(value,kwargs.pop(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
np.append(value,kwargs.pop(name)) | |
np.append(value, kwargs.pop(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this isn't correct, np.append
does not work in-place. See the documentation: https://numpy.org/doc/1.18/reference/generated/numpy.append.html This will most likely be very slow, since NumPy allocates a new array every time np.append()
is called. I wonder if it would be better to cast the values in _extra
to lists and use the list append method, and then cast them back to arrays? Maybe the complexity isn't worth it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also can't find any other way around see
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean, any other way around? You'll have to find some way, since this won't work as expected. I think the simplest method is just to reassign value
: value = np.append(value, kwargs.pop(name))
That might be slow, since NumPy has to create a new array every time, but hopefully not too slow.
I forgot to review the test file as well! Sorry, my main comment there was that I think each time you have a comment that delimits a different set of tests, that should be a new test function. |
@sin-ha Are you going to be able to address my comments? Thanks 😄 |
Yes @bryanwweber, actually i was a bit busy with my never ending university exams and forgot about it |
Thanks @sin-ha I certainly understand having lots of exams 😊 |
edebe9b
to
d9ec808
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes @sin-ha, just a few things left!
done @bryanwweber . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sin-ha. One change still to do, and please feel free to add yourself to AUTHORS
!
@@ -608,7 +623,7 @@ def append(self, state=None, **kwargs): | |||
raise IndexError("Can only append to 1D SolutionArray") | |||
|
|||
for name, value in self._extra.items(): | |||
value.append(kwargs.pop(name)) | |||
np.append(value, kwargs.pop(name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line still isn't fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't think this through before, Anyways
I think the simplest method is just to reassign
value
: `value = np.append(value, kwargs.pop(name))
Appending np.ndarrays
is not as fast
Turns out converting them to lists and appending (as you said before). does a better job!!,
see https://stackoverflow.com/a/22394181 and https://stackoverflow.com/questions/7133885/fastest-way-to-grow-a-numpy-numeric-array
In [2]: %%timeit
...: a = np.empty((0,3), int)
...: l = list(a)
...: for i in range(1000):
...: l.append([3*i+1,3*i+2,3*i+3])
...: l = np.array(l)
467 µs ± 6.32 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [3]: %%timeit
...: a = np.array([3,4,3])
...: for i in range(1000):
...: l = list(a)
...: l.append([3*i+1,3*i+2,3*i+3])
...: l = np.array(l)
...:
2.27 ms ± 531 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [4]: %%timeit
...: a = np.empty((0,3), int)
...: for i in range(1000):
...: a = np.append(a, 3*i+np.array([[1,2,3]]), 0)
...:
3.88 ms ± 21.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
@sin-ha I pushed a few more commits to this branch to fix a few edge cases I found in some testing. For your reference, the changes are in commits b86786b, 1cc53b4, and 8786b51. The other commits fix a few other things I found while working on this code. The main things to watch out for are:
|
The test failures are due to using NumPy 1.11. The docs for The default |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Ubuntu 16.04 support ends April 2021, which is less than a year out. Citing a conservative (albeit current) distro, CentOS8 uses PS: It appears that numpy requirements are actually not strictly enforced in |
Thanks for the thoughts on NumPy @ischoegl. It is really good to know that a conservative distribution like CentOS includes 1.14. Until this PR, there has not been a required minimum version of NumPy, we have been able to be flexible since we hadn't used very many of the user-interface features (basically, we just used the C headers to provide arrays in Cython). Hence, it was only necessary to provide a warning for versions that we hadn't tried of NumPy. Going forward, it seems it may be necessary to enforce an actual minimum version. |
The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.
The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in Cantera#838.
The minimum NumPy version is 1.12 to support the use of numpy.full() in the SolutionArray interface introduced in #838.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From reading the discussion, I think all of the previously-mentioned issues have been hashed out at this point, right? I just had one case where I think the behavior should change.
raise ValueError("Initial values for extra properties must be " | ||
"supplied in a dict if the SolutionArray is not " | ||
"initially empty") | ||
self._extra[name] = np.empty(self._shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't the 'extra' arrays should be allowed to contain uninitialized values. The existing behavior of requiring initial values if the SolutionArray
is not empty should be preserved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@speth ... I’d ask to consider retaining this behavior (while potentially avoiding np.empty
).
Think of a scenario where extra
is calculated based on the content of the SolutionArray
itself (which typically receives the actual value using some setter only after it is instantiated, so the content of extra is not known a priori). For this case, you’d have to provide dummy values to create extra
columns, which would be subsequently overwritten. I believe that automatically initializing values as zeros would be a lot more convenient (and avoid np.empty
); columns that contain strings are presumably rare, so having to provide a value there would be ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ischoegl I originally had the same objection as you, but I considered it for a while and I ended up agreeing with @speth. There are two main reasons for this, as I see it:
-
It is trivially easy to specify a dummy value, since you do not need to specify the shape. For instance:
>>> arr = ct.SolutionArray(gas, (3, 10), extra={"prop": 0, "prop2": 0.0}) >>> arr.prop array([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]) >>> arr.prop2 array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) >>> # This is not allowed right now, but would be almost the same, except the dtype >>> arr = ct.SolutionArray(gas, (3, 10), extra=["prop", "prop2"])
Since I've used
np.full()
to fill the initial value, there's no need to give an array as the initial value. -
It isn't clear what
dtype
should be used - it is not only strings, but whether one wants to use floats vs. ints or any other dtype as well. I suppose floats could be used in place of ints, but there are legitimate use cases for the int dtype, for instance, as a categorical or level variable. I think it would also not be clear why one would have to specify an initial value for some dtypes but not for others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bryanwweber ... I didn’t think of the scalar route, so your argument is convincing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The availability of the scalar initializer is a nice compromise for the case where the value is going to be calculated anyway.
If I'm reading this block of code correctly now, the only time that this line can be called is if self._shape
is (0,)
, right? It might be more clear if the value was just set to self._extra[name] = np.empty(0)
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's correct, it only runs if self._shape == (0,)
. I'm not sure it's clearer to just pass 0
though. Shapes of NumPy arrays are always returned as tuples, and just writing np.empty(0)
implies to me that 0
will be the value in the array (notwithstanding the name empty
), not that there's a shape of 0
. So I think it would be better to use np.empty(shape=(0,))
, which is then just self._shape
, and putting in (0,)
I thought would be more confusing than using self._shape
, because the question arises, if these are equal why not just use the existing variable name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think explicitly showing the size as zero (with whatever notation you like) is preferable because it makes it clear that this doesn't result in an array with uninitialized values. Tracing the logic back to see that self._shape
is (0,)
if we're in this branch is not really that simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is fixed now with shape=(0,)
👍
Fix the formatting of the two KeyErrors in SolutionArray.append(). The first one was missing the format method on the string. The second was raising a double-KeyError due to the way the except is handled.
When appending to a SolutionArray, any extra values must be specified as kwargs. If they aren't present, the array lengths would become out of sync, or there would be a KeyError on the kwargs dictionary. Also adds a test for the failure modes of SolutionArray.append().
Move append of extra values to the end of SolutionArray.append(). If the append is done at the beginning of the function, the append can happen even if the state is invalid. This would cause the length of the arrays to become out of sync.
If a single value (e.g., integer or float) is passed as the value for an extra column, the array with the single value cannot be reshaped. Use np.full() instead. Add/rename tests for creating extra items by dicts and iterables, along with tests of the failure conditions.
Make sure that ndarrays are one-dimensional. Move check for bare string. Clarify error messages. Reduce indentation. Add tests for creating extra items from bare strings and ndarrays.
This change bumps the builder that uses the system Python 2 to run SCons and the system Python 3 for the Python interface to Ubuntu 18.04 from 16.04. The reason for this change is that Ubuntu 16.04 provides NumPy 1.11, which does not support flexible dtypes when creating arrays using .full(). We have decided to drop support for NumPy older than 1.12 for this reason. Numpy 1.12 was released in January 2017.
If the extra columns are passed to the SolutionArray, they must either have initial values or the SolutionArray must be empty.
Checklist
scons build
&scons test
) and unit tests address code coverageIf applicable, fill in the issue number this pull request is fixing
Fixes #829
Changes proposed in this pull request
-Earlier the self._extra[name] was initialised as a list but when the set item method was called it returned a
np.array()
copy of it-So I propose to initialise the init method with the np.array and amended other methods to abide with it
-I think that this would be a better implementation.